multiple speaker
Representation of perceived prosodic similarity of conversational feedback
Qian, Livia, Figueroa, Carol, Skantze, Gabriel
V ocal feedback (e.g., 'mhm', 'yeah', 'okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Sweden (0.04)
- Europe > Germany (0.04)
Building a Luganda Text-to-Speech Model From Crowdsourced Data
Kagumire, Sulaiman, Katumba, Andrew, Nakatumba-Nabende, Joyce, Quinn, John
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
- Africa > Uganda > Central Region > Kampala (0.05)
- Africa > East Africa (0.04)
Detecting Check-Worthy Claims in Political Debates, Speeches, and Interviews Using Audio Data
Ivanov, Petar, Koychev, Ivan, Hardalov, Momchil, Nakov, Preslav
A large portion of society united around the same vision and ideas carries enormous energy. That is precisely what political figures would like to accumulate for their cause. With this goal in mind, they can sometimes resort to distorting or hiding the truth, unintentionally or on purpose, which opens the door for misinformation and disinformation. Tools for automatic detection of check-worthy claims would be of great help to moderators of debates, journalists, and fact-checking organizations. While previous work on detecting check-worthy claims has focused on text, here we explore the utility of the audio signal as an additional information source. We create a new multimodal dataset (text and audio in English) containing 48 hours of speech. Our evaluation results show that the audio modality together with text yields improvements over text alone in the case of multiple speakers. Moreover, an audio-only model could outperform a text-only one for a single speaker.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
- (3 more...)
- Media > News (1.00)
- Government > Regional Government > North America Government > United States Government (0.47)
Tech Talk: How AI Is Serving the Restaurant Industry
As the Chief Revenue Officer at HungerRush, Olivier Thierry is influencing customer expectations with AI as the restaurant industry has begun experimenting with it, he tells Spiceworks News & Insights' Technology Editor, Neha Kulkarni. Restaurants have realized taking on new technology will help them not only survive the challenges but achieve results, he notes. From labor shortages to improving customer experience, in this edition of Tech Talk, Olivier discusses how AI can overcome these challenges and allow restaurants to reduce human error. He also shares how natural language processing can interpret customer attitudes in phone orders and have a real place in understanding customer experience. Olivier: The pandemic turned the restaurant industry upside down, and many of its setbacks are still being felt today.
Multi-speaker Text To Speech
Speech synthesis (Text-to-speech, TTS) is the formation of a speech signal from printed text. In a way, it is the opposite of speech recognition. Speech synthesis is used in medicine, dialogue systems, voice assistants and many other business tasks. As long as we have one speaker, the task of speech synthesis at first glance looks quite clear. When several speakers come into play, the situation becomes somewhat complicated and other tasks come into play; for example, voice cloning and voice conversion, this will be discussed further in the text.
Text to Speech System for Multi-Speaker Setting
What would you want to do if you could generate the voice of your favorite celebrity? Before I get ahead of myself, let me clearly define the objective of this blog. Given text and some voice clips of the desired speaker (say, Beyonce), I want my AI to output an audio clip where Beyonce is speaking the text that I input to this code. So essentially, this is the same Text To Speech (TTS) problem we saw earlier but with an added constraint to output the speech in a particular speaker's voice. In this blog, I share two methods that can complete our task, and I will be comparing these two methods at the end.
Voice 'Fingerprint' Propels Speaker Recognition
The accuracy of automatic speech recognition has made significant gains in the last few years thanks to the advent of deep neural networks. But there's one area that has thwarted researchers: telling multiple speakers apart. Now a startup called Chorus says it has made a breakthrough in the matter through a technique it calls "voice fingerprinting." Speech recognition and computer vision arguably are the two computational challenges that have benefited the most from deep learning. Armed with huge training sets – including vast troves of photographs and digital recordings of voices – convolutional neural network (CNNs) and recurrent neural networks (RNNs) have given computers sensory perception that can almost rival humans' senses.
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Asia > Middle East > Israel (0.05)
How to listen to live baseball games on an Amazon Echo
Now that baseball season is underway, one of the easiest ways to listen to the games is on an Amazon Echo or another Alexa device. With TuneIn Live or MLB At Bat, you can stream live broadcasts from any Major League Baseball game using voice commands. TuneIn's service costs $3 per month for Amazon Prime subscribers (or $4 per month for non-subscribers) and also includes news and live sports from other leagues. MLB's Gameday Audio service costs a one-time payment of $20 for the entire 2018 season. Audio streams are also included with an MLB TV Premium subscription, which offers live video broadcasts for out-of-market games and costs $25 per month or $116 for the season.
An AI has learned how to pick a single voice out of a crowd
Devices like Amazon's Echo and Google Home can usually deal with requests from a lone person, but like us they struggle in situations such as a noisy cocktail party, where several people are speaking at once. Now an AI that is able to separate the voices of multiple speakers in real time promises to give automatic speech recognition a big boost, and could soon find its way into an elevator near you. The technology, developed by researchers at the Mitsubishi Electric Research Laboratory in Cambridge, Massachusetts, was demonstrated in public for the first time at this month's Combined Exhibition of Advanced Technologies show in Tokyo. It uses a machine learning technique the team calls "deep clustering" to identifies unique features in the "voiceprint" of multiple speakers. It then groups the distinct features from each speaker's voice together, allowing it to disentangle multiple voices and then reconstruct what each person was saying.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.27)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.27)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.59)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)